{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# FinBERT Sentiment Classification for Long Text corpus with token size much larger than 512\n",
    "\n",
    "# [Link to my Youtube Video Explaining this whole Notebook](https://www.youtube.com/watch?v=WEAAs_0etJQ&list=PLxqBkZuBynVQEvXfJpq3smfuKq3AiNW-N&index=9)\n",
    "\n",
    "[![Imgur](https://imgur.com/f6RpLCQ.png)](https://www.youtube.com/watch?v=WEAAs_0etJQ&list=PLxqBkZuBynVQEvXfJpq3smfuKq3AiNW-N&index=9)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First What is BERT?\n",
    "\n",
    "BERT stands for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.\n",
    "\n",
    "BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.\n",
    "\n",
    "### There are two different BERT models:\n",
    "\n",
    "- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.\n",
    "\n",
    "- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.\n",
    "\n",
    "\n",
    "\n",
    "BERT Input and Output\n",
    "BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:\n",
    "\n",
    "- [CLS]: This is the first token of every sequence, which stands for classification token.\n",
    "- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.\n",
    "\n",
    "\n",
    "It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.\n",
    "\n",
    "And that’s all that BERT expects as input.\n",
    "\n",
    "BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.\n",
    "\n",
    "\n",
    "------------\n",
    "\n",
    "**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.\n",
    "\n",
    "-----------------------\n",
    "\n",
    "![Imgur](https://imgur.com/NpeB9vb.png)\n",
    "\n",
    "-------------------------"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import BertForSequenceClassification, BertTokenizer\n",
    "import torch\n",
    "\n",
    "tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')\n",
    "model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source - https://www.brookings.edu/2022/08/11/the-future-of-crypto-regulation-highlights-from-the-brookings-event/\n",
    "\n",
    "txt = \"\"\"\n",
    "1. HIGHER LEVELS OF RETAIL PARTICIPATION IN CRYPTO THAN TRADITIONAL COMMODITY MARKETS POSE UNIQUE CHALLENGES FOR REGULATORS.\n",
    "\n",
    "One in five Americans report having traded cryptocurrency, and polls suggest crypto trading is more common among younger adults, men, and racial minorities. This is quite different from other financial instruments regulated by the CFTC, Benham noted. “You’re going to have more vulnerable investors… It’s incumbent on us to educate, to inform, to disclose risks involved.”\n",
    "\n",
    "Michael Piwowar, a former Securities and Exchange Commissioner and now executive director of the Milken Institute Center for Financial Markets, linked increased Congressional attention to growth in retail crypto: “If you got one in five households that have interacted with crypto… [members of Congress] are going to start hearing it from their constituents.” Legislation to regulate digital assets has been introduced by Senators Lummis and Gillibrand, Stabenow and Boozman, and Toomey, as well as Representative Gottheimer. The Treasury is actively negotiating bipartisan stablecoin legislation with House Financial Services Committee Chair Waters and Ranking Member McHenry. Benham said that stablecoins, digital currency meant to always be equal to one dollar, are more of a “payment mechanism” and thus should be regulated by prudential banking regulators.\n",
    "\n",
    "Digital asset regulation may require addressing crypto exchanges and digital wallets. American University Law Professor Hilary Allen noted that the stablecoin legislation under discussion does not, saying, “That is a gaping hole… Almost every major stablecoin… is affiliated with an exchange that profits from trading in that stablecoin.” Mark Wetjen, a former CFTC commissioner and current head of policy and regulatory strategy for FTX (one of the largest crypto exchanges), agreed: “The exchanges are the gateways to the entire crypto space, and so oversight of them is probably most important.” He pushed back that there was no current regulation, noting the requirement for state level licenses, such as New York’s Bitlicense: “If you want to list derivatives on bitcoin, for example, you need a license… so it may not be as dire a situation.”\n",
    "\n",
    "2. CRYPTO CHALLENGES TRADITIONAL REGULATORY DISTINCTION BETWEEN SECURITIES AND COMMODITIES.\n",
    "\n",
    "Traditionally, the SEC regulates securities while the CFTC regulates commodities and derivatives. Whether crypto is a security or commodity remains unclear, as various subcomponents of the crypto ecosystem challenge existing regulatory divisions. For instance, the SEC recently argued  that nine different crypto tokens were securities in an insider trading case while a federal judge ruled that virtual currency like Bitcoin constitutes a commodity.\n",
    "\n",
    "Benham called on Congress to provide clarity on which of the hundreds – if not thousands – of coins in existence are securities versus commodities: “Ultimately, we’d like to see law drawing lines.” Piwowar said the lack of clarity creates unwelcome delays as many crypto-related applications before the SEC are “not getting answers” on whether their products represent securities. The result is that some crypto firms are “going outside the United States” to locate their business. Allen cautioned, though, that Congressional action could also constitute an indication that the government supports crypto. She warned against letting crypto into the regulated sphere for fear of giving it “implicit guarantees.”\n",
    "\n",
    "A solution to the regulatory turf battle could be merging the SEC and CFTC, which Piwowar endorsed, as have many others. Congress, however, has shown little appetite to do so given the different Congressional committee jurisdictions involved.\n",
    "\n",
    "3. CFTC WILL RESTRUCTURE TO BETTER PROTECT CONSUMERS AND MORE EFFECTIVELY REGULATE MARKETS.\n",
    "Benham announced several changes at the CFTC during the Brookings event. First, LabCFTC will become the Office of Technology Innovation, reporting directly to the Chairman’s office. Behnam justified this by stating, “We are past the incubator stage, and digital assets and decentralized financial technologies have outgrown their sandboxes.” Second, CFTC’s Office of Customer Education and Outreach will be realigned within the Office of Public Affairs, which Behnam said would “leverage resources and a broader understanding of the issues facing the general public towards addressing the most critical needs in the most vulnerable communities.” Restructuring within a regulator may appear a bureaucratic shuffle but can reflect changes in internal power, agency focus, and prioritization. Directly reporting to the chair increases an office’s authority and prestige.\n",
    "\n",
    "4. IS CRYPTO A PASSING FAD (OR WORSE, A BUBBLE THAT THREATENS FINANCIAL MARKETS)?\n",
    "Allen argued that crypto is “purposely less efficient and more complicated than a more centralized system,” and does not have any societal value. FTX’s Wetjen disagreed: “The difference here with blockchain as the underpinning means by which you can transfer value is that there are absolutely no gates.” Piwowar broadly agreed with Wetjen that “We’re going to have the new generation of Amazons and Googles come out of this stuff,” but cautioned that while he was at the SEC, “Nine out of ten [crypto applications] were outright fraud, and then out of the one out of ten, nine out of ten of those were probably fraud.” Since January 2021, over 46,000 people have collectively lost over $1 billion to scams involving crypto.\n",
    "\n",
    "Everyone wants to avoid a repeat of the 2008 global financial crisis. To do so, regulators have focused on avoiding and mitigating “systemic risk” to the financial system. Asked if he sees a “clear and present danger to the existing economic system,” Benham said he did not, pointing out that crypto is not sufficiently interconnected to pose systemic risk. He noted the decrease in crypto values over the past several months did not cause ripples in the financial system or the broader economy. Piwowar turned the question of systemic risk back onto the actions of financial regulators asking: “What is systemic risk?  It’s the risk that a federal policymaker is going to bail out a bank, either directly or indirectly.” Allen agreed that bailing out crypto would be a mistake quipping: “If anything should be able to fail, it should be crypto, which isn’t… funding productive economic capacity.”\n",
    "\n",
    "Allen also noted the similarity in arguments centered on American global competitiveness which promoted lax regulation for derivatives: “It’s almost identical to the rhetoric we saw around swaps in the 1990s.” Credit default swaps, like crypto now, faced loose regulation and ultimately helped fuel the subprime mortgage crisis. Behnam noted that one of 2008’s biggest lessons was the need for the CFTC to promote market transparency in the “OTC [over-the-counter] derivative space.” Crypto proponents point to the underlying technology as being inherently more transparent, while critics point to the lack of understanding of aspects of the market, such as what assets back stablecoins like Tether.\n",
    "\n",
    "5. DOES CRYPTO INCREASE FINANCIAL INCLUSION?\n",
    "Cryptocurrency proponents frequently cite financial inclusion as a major benefit linking the higher usage of youth and communities of color who have higher rates of being unbanked or underbanked by traditional finance. Allen cautioned against “predatory inclusion” arguing that, “Because there’s no productive capacity behind them, their value derives from finding someone else to buy them from you.” Wetjen’s responded, blending his experience serving as a CFTC Commissioner with his time in the crypto industry: “From my own experience… at the CFTC, there’s plenty of authority that’s already in place for the agency to… be pretty thoughtful and relatively prescriptive, even in terms of what actually should be disclosed to, particularly, retail investors, or users of a platform such as FTX.” He argued that the right policy is “giving people the opportunity to be involved and invest in the space that they like but making sure that it’s done with the right safeguards.”\n",
    "\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[1015, 1012, 3020,  ..., 2015, 1012, 1524]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens = tokenizer.encode_plus(txt, add_special_tokens = False, return_tensors = 'pt')\n",
    "\n",
    "print(len(tokens))\n",
    "\n",
    "tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1653"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(tokens['input_ids'][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_id_chunks = tokens['input_ids'][0].split(510)\n",
    "attention_mask_chunks = tokens['attention_mask'][0].split(510)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(attention_mask_chunks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "510"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(attention_mask_chunks[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([510])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "input_id_chunks[0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_input_ids_and_attention_mask_chunk():\n",
    "    \"\"\"\n",
    "    This function splits the input_ids and attention_mask into chunks of size 'chunksize'. \n",
    "    It also adds special tokens (101 for [CLS] and 102 for [SEP]) at the start and end of each chunk.\n",
    "    If the length of a chunk is less than 'chunksize', it pads the chunk with zeros at the end.\n",
    "    \n",
    "    Returns:\n",
    "        input_id_chunks (List[torch.Tensor]): List of chunked input_ids.\n",
    "        attention_mask_chunks (List[torch.Tensor]): List of chunked attention_masks.\n",
    "    \"\"\"\n",
    "    chunksize = 512\n",
    "    input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))\n",
    "    attention_mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))\n",
    "    \n",
    "    for i in range(len(input_id_chunks)):\n",
    "        input_id_chunks[i] = torch.cat([\n",
    "            torch.tensor([101]), input_id_chunks[i], torch.tensor([102])\n",
    "        ])\n",
    "        \n",
    "        attention_mask_chunks[i] = torch.cat([\n",
    "            torch.tensor([1]), attention_mask_chunks[i], torch.tensor([1])\n",
    "        ])\n",
    "        \n",
    "        pad_length = chunksize - input_id_chunks[i].shape[0]\n",
    "        \n",
    "        if pad_length > 0:\n",
    "            input_id_chunks[i] = torch.cat([\n",
    "                input_id_chunks[i], torch.Tensor([0] * pad_length)\n",
    "            ])\n",
    "            attention_mask_chunks[i] = torch.cat([\n",
    "                attention_mask_chunks[i], torch.Tensor([0] * pad_length)\n",
    "            ])\n",
    "            \n",
    "    return input_id_chunks, attention_mask_chunks "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_id_chunks, attention_mask_chunks = get_input_ids_and_attention_mask_chunk()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[  101,  1015,  1012,  ...,  1010,  1996,   102],\n",
       "         [  101, 10819,  3728,  ...,  1521,  2128,   102],\n",
       "         [  101,  2183,  2000,  ...,  2078,  1521,   102],\n",
       "         [  101,  1055,  5838,  ...,     0,     0,     0]]),\n",
       " 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],\n",
       "         [1, 1, 1,  ..., 1, 1, 1],\n",
       "         [1, 1, 1,  ..., 1, 1, 1],\n",
       "         [1, 1, 1,  ..., 0, 0, 0]], dtype=torch.int32)}"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "input_ids = torch.stack(input_id_chunks)\n",
    "attention_mask = torch.stack(attention_mask_chunks)\n",
    "\n",
    "input_dict = {\n",
    "    'input_ids' : input_ids.long(),\n",
    "    'attention_mask' : attention_mask.int()\n",
    "}\n",
    "\n",
    "input_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([4, 512])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "input_dict['input_ids'].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([4, 512])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "input_dict['attention_mask'].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0.0767, 0.1188, 0.8045], grad_fn=<MeanBackward1>)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outputs = model(**input_dict)\n",
    "\n",
    "probabilities = torch.nn.functional.softmax(outputs[0], dim = -1 )\n",
    "\n",
    "mean_probabilities = probabilities.mean(dim = 0)\n",
    "\n",
    "mean_probabilities\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.argmax(mean_probabilities).item()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.10 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
